BTS 510 Lab 6

set.seed(12345)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Stat2Data)
theme_set(theme_classic(base_size = 16))

0.1 Learning objectives

  • Describe the logic of linear regression
  • Describe the assumptions of linear regression
  • Briefly present results of linear regression
  • Use a regression equation to summarize a model
  • Compare and contrast observed, predicted, and residual values

0.2 Data

  • ICU data from the Stat2Data package
    • ID: Patient ID code
    • Survive: 1 = patient survived to discharge or 0 = patient died
    • Age: Age (in years)
    • AgeGroup: 1 = young (under 50), 2 = middle (50-69), 3 = old (70+)
    • Sex: 1 = female or 0 = male
    • Infection: 1 = infection suspected or 0 = no infection
    • SysBP: Systolic blood pressure (in mm of Hg)
    • Pulse: Heart rate (beats per minute)
    • Emergency: 1 = emergency admission or 0 = elective admission

0.3 Tasks

  1. Does blood pressure differ between those with a suspected infection and those without? Fit a linear regression to answer this question.
library(Stat2Data)
data(ICU)
m1 <- lm(data = ICU, SysBP ~ Infection)
summary(m1)

Call:
lm(formula = SysBP ~ Infection, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-87.452 -18.672  -2.062  18.548 117.328 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  138.672      2.986  46.440  < 2e-16 ***
Infection    -15.220      4.608  -3.303  0.00113 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 32.16 on 198 degrees of freedom
Multiple R-squared:  0.05223,   Adjusted R-squared:  0.04744 
F-statistic: 10.91 on 1 and 198 DF,  p-value: 0.001134
  1. Write up the interpretation of the model, including:
  • Intercept value, test statistic, p-value
  • Intercept interpretation (in words)
  • Slope value, test statistic, p-value
  • Slope interpretation (in words)
  • R^2 value, test statistic, p-value
  • R^2 interpretation (in words)
  1. What is the predicted blood pressure for someone with a suspected infection? What is the predicted blood pressure for someone without a suspected infection?
  • Predict function
newdat <- data.frame(Infection = c(0,1))
predict(m1, newdat)
       1        2 
138.6724 123.4524 
  • Use R as a smart calculator
bp_0 <- summary(m1)$coefficients[1] + summary(m1)$coefficients[2] * 0
bp_0
[1] 138.6724
bp_1 <- summary(m1)$coefficients[1] + summary(m1)$coefficients[2] * 1
bp_1
[1] 123.4524
  • Use R as a dumb calculator
bp_0d <- 138.672 - 15.220 * 0
bp_0d
[1] 138.672
bp_1d <- 138.672 - 15.220 * 1
bp_1d
[1] 123.452
  1. Calculate the predicted and residual values for each person and add them to the original dataset. Does it looks like the assumptions are satisfied? Make plots to help you decide:
  • Residual vs predictor
  • Q-Q plot of residuals
  • Histogram of residuals (all together and separately for the two groups)
ICU$pred1 <- fitted(m1)
ICU$resi1 <- residuals(m1)

ggplot(data = ICU, 
       aes(x = Infection, y = resi1)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

ggplot(data = ICU,
       aes(sample = resi1)) + 
       stat_qq() +
       stat_qq_line()

ggplot(data = ICU,
       aes(x = resi1)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = ICU,
       aes(x = resi1)) +
  geom_histogram() +
  facet_grid(cols = vars(Infection))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Does blood pressure differ between those with a suspected infection and those without? How? (Brief, simple words, no jargon, no statistics.)

1 Plot.

int <- summary(m1)$coefficients[1]
slo <- summary(m1)$coefficients[2]
ggplot(data = ICU,
       aes(x = Infection, y = SysBP)) +
  geom_point(alpha = 0.3) +
  annotate("text", 
    x = 0.5, 
    y = 250, 
    label = bquote(~ hat(Y) == .(round(int, 3)) + .(round(slo, 3))~"* Infection"),
    size = 6)
Warning in is.na(x): is.na() applied to non-(list or vector) of type 'language'